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Abstract 

The present paper proposes an adaptive biasing potential for the computation of 
^ ■ free energy landscapes. It is motivated by statistical learning arguments and unifies 

the tasks of biasing the molecular dynamics to escape free energy wells and estimat- 
ing the free energy function, under the same objective. It offers rigorous convergence 
diagnostics even though history dependent, non-Markovian dynamics are employed. 
It makes use of a greedy optimization scheme in order to obtain sparse representa- 
tions of the free energy function which can be particularly useful in multidimensional 
cases. It employs embarrassingly parallelizable sampling schemes that are based on 
adaptive Sequential Monte Carlo and can be readily coupled with legacy molecular 
dynamics simulators. The sequential nature of the learning and sampling scheme 
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enables the efficient calculation of free energy functions parametrized by the temper- 
ature. The characteristics and capabilities of the proposed method are demonstrated 
in three numerical examples. 

Key words: free energy computations, adaptive biasing potential, Sequential 

Monte Carlo, atomistic simulations, statistical learning 
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1 Introduction 

Free energy is a central concept in thermodynamics and in the study of several 
systems in biology, chemistry and physics [8]. It represents a rigorous way to 
coarse-grain systems consisting of very large numbers of atomistic degrees of 
freedom, to probe states not accessible experimentally, to characterize global 
changes as well as investigate relative stabilities. In most applications, a brute- 
force computation based on sampling the atomistic positions is impractical or 
infeasible as the free energy barriers to overcome are so large that the system 
remains trapped in metastable free energy sets [10P2|8|56] . 




Equilibrium techniques for computing free energy surfaces such as Thermo- 
dynamic Integration [29] or Adaptive Integration |^9|2l] require the simula- 
tion of very long atomistic trajectories in order to achieve equilibrium and 
lack convergence diagnostics. Techniques based on non-equilibrium path sam- 
pling [26|27f23|T25] lack adaptivity and require the user to specify a particular 
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path on the reaction coordinate space connecting two energetically impor- 
tant free energy regions, which can be non-trivial a task [20]. Furthermore, 
sampling along these paths correctly might necessitate advanced and quite 
involved techniques [10]. More recently proposed adaptive biasing potential 
[2|55|32|[Tiri8j and adaptive biasing force [TT|12|41|53|24] techniques are ca- 
pable of dynamically utilizing information obtained from the atomistic tra- 
jectories to bias the current dynamics in order to facilitate the escape from 
metastable sets [35] . They are able to automatically discover important regions 
of the reaction coordinate space. Since they rely on history-dependent, non- 
Markovian dynamics, it is not a priori clear, and in which sense, the system 
reaches a stationary state, although some work has been done along theses 
lines in [I] for Langevin-type systems and [3"5f3"5] . 

We propose an adaptive biasing potential technique where the two tasks of 
biasing the dynamics and estimating the free energy landscape are unified 
under the same objective of minimizing the Kullback-Leibler divergence be- 
tween appropriately selected distributions on the extended space that includes 
atomic coordinates and the collective variables [37|38] . This framework pro- 
vides a natural way for selecting the basis functions used in the approximation 
of the free energy and obtaining sparse representations which is critical when 
multi-dimensional collective variables are used. It allows the analyst to utilize 
and correct any prior information on the free energy landscape and provides 
an efficient manner of obtaining good estimates at various temperatures. The 
scheme proposed is embarrassingly parallelizable and relies on adaptive Se- 
quential Monte Carlo procedures which enable efficient sampling from the 
high-dimensional and potentially multi- modal distributions of interest. 
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2 Methodology - A statistical learning approach for adaptively cal- 
culating free energies 

For clarity of the presentation, we will first introduce our method for the 
so-called alchemical case and generalize it later for the reaction coordinate 
case. Consider a molecular system with generalized coordinates q G M. C 
M, N following a Boltzmann-like distribution which in turn depends on some 
parameters z G T> C M. d 

p(q\z) oc exp(-(3V{q;z)) (1) 

where V(q;z) is the potential energy of the system and (3 plays the role 
of inverse temperature. The free energy A(z) is defined, up to an additive 
constant, by: 

A(z) = -/T 1 / exp (-pV(q) z)) dq (2) 
Our goal is to compute the function A(z) over the whole domain T>. 

Let A{z\ 0) be an estimate of A(z) parametrized by 9 G C M. K . We adopt 
a statistical perspective of learning A{z) from simulation data. A popular 
approach to carrying out regression tasks and functional approximations relies 
on kernel models Kernel regression models have proven successful in high- 
dimensional scenaria where d is in the order of 10 or 100 [5T|l52] . The unknown 
function is selected from a Reproducing Kernel Hilbert Space (RKHS) Hk 
induced by a semi-positive definite kernel K(-,-). We adopt representations 
with respect to a kernel function K(-,-) [28] : 

A(z- i e) = J20 J K J (z,z j ), zeV (3) 

where Zj are points in T> which are selected as described in the sequence, 
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In order to 
i(z;0) = 0. 



ix the additive constant, we select a point zq G T> such that: 
3 



In relevant literature different types of kernel functions have been used such 
as thin plate splines, multiquadrics, or Gaussians. While all these functions 
can be employed in the framework presented, we focus our discussion here on 
Gaussian kernels which also have an intuitive parametrization with regards 
to the scale of variability of A as quantified by the bandwidth parameters 
Tj = {rj t i}f =1 in each dimension: 

d 

K j( z ) = K ( z , z h T i) = ex P{- S T iM ~ z o>if} ( 4 ) 

1=1 

Gaussian kernels in the context of free energy approximations have also been 
used in I321I37II18I. 



We define a joint probability distribution on the generalized coordinates q and 
the parameters z as follows: 

p(*,z\0) = ^Mz)e-e( v M-*'*>) (5) 

where lv{z) is the indicator function on T> and Z(0) is the normalization 
constant, i.e.: 

Z{0) = J l v {z)e-^ v ^ z) - A{z ' e) ) dzalO (6) 

It is noted that the first-order partial derivatives of the log of the normalization 
function give rise to the expectation of the respective kernel: 

dlogZ 1 dZ 



d6j Z[p) d6j 



(3E p{zle) [K 3 {z)\ (7) 



1 This is always possible by changing the kernels in Equation [3] to Kj(z,Zj] 
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whereas the second-order derivatives, produce the covariance between the ker- 
nels: 



d 2 log Z _ 1 dZ dZ j 1 d 2 Z 

ae J de l z 2 {o) ae 3 ae t z(o) de j ae l 

= pE v(m [(K,(z) - E p ( g \e)[Kj(z)])(Ki(z) - E^K^z)}) 

= P 2 Cov p{zle) [K j} Ki\ 

The expectations in the two equations above involve the unknown marginal 
density with respect to the parameters z6D: 



(9) 



m L v {z)e v > 



Z(6)- 

which depends on the unknown free energy A(z) of Equation (j2J). 

The key property of p(z \ 6) is that it reduces to the uniform distribution for 
z G T> if and only if the free energy estimate is exact i.e. A(z; 6) = A(z), z G 
V. 

If ir(z) = 1t>(z)-^ is the uniform density on T> (whose volume is denoted by 
| T> |), then a natural strategy to estimate A[z) is by minimizing a distance 
metric between 7r(z) and p(z | 6) in Equation ([9]) above. To that end, we 
propose employing the Kullback-Leibler (KL) divergence KL(7r(z) || p(z | 6)) 

KL(tt II p )=j 7T (z)\og^dz 

= J tt(z) \ogn(z)dz - J tt(z) \ogp(z \ 6) dz ( 10 ) 
= - \og\V \ - J tt(z) logp(z | 0) dz > 
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The latter is not a metric in the mathematical sense, but it is frequently used 
as a measure of the distance between two probability distributions. It is always 
non-negative and becomes zero if and only if ir(z) = p(z | 0) or equivalently 
A(z;0) = A(z), 2 £ D 0. The aforementioned formulation offers a clear 
strategy for estimating the free energy by minimizing the following form with 
respect to 0: 

1(0) = - J tt(z) \ogp(z | 0)dz (11) 

Since the KL-divergence is always non-negative, the formulation above pro- 
vides a lower bound on the objective function 1(0): 

1(0) > log | V | (12) 

which can be readily calculated and be used to monitor convergence as well 
as the quality of the approximation obtained. 

Even though 1(0) depends on the unknown free energy A(z) (from Equation 
(ED): 



1(0) = - Jn(z)\ogp(z | 0)dz 



(13) 



J tt(z) (A(z) - A(z- 0)) dz + log Z(0) 



its partial derivatives J{0) = do not , i.e. from Equation (JTJ): 



JA0) = m 



-PE. 



7r(z) 



dA 



do, 



+ 



dlogZ (14) 



-P [E <a) [Kj(z)] - E pm [K 3 (z)}) 



2 Of interest are free-energy differences and therefore perturbations of A(z) or 
A(z; 0) by a constant are ignored 
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where ^( z )[-] implies an expectation with regards to vr(z). 

It is important to note that according to Equation (jSj), the Hessian of the 
objective function H(6) = qq^qt is proportional to the covariance between 
the kernels i.e.: 

a 2 i diogzje) 

(15) 

= (3 2 Cov p{zm [K JJ K l ] 
Hence the objective function is convex with respect to 6 and there is a unique 
minimum. 

Furthermore, the approximation of the free energy A(z; 0), biases the poten- 
tial of p(q, z | 6) (Equation dSJ)) and allows the system to overcome free energy 
barriers [39]. As in [18] . no binning is needed and the bias potential is non- 
local, providing information about the free energy landscape not only at the 
states visited but in their neighborhood as well. In contrast to other adaptive 
schemes, the proposed formulation connects the problems of estimating the 
free energy landscape and steering the atomistic dynamics beyond metastable 
wells, under a unified umbrella, and provides a clear convergence criterion [53] . 

From an algorithmic point of view, the proposed strategy poses two problems. 
The first involves the selection of the kernels to be used in the expansion of 
Equation This is critical to the sparseness of the representation obtained, 
particularly for multidimensional z. To that end, a greedy selection strategy 
is discussed in section [2721 which progressively adds kernels (i.e. increases the 
cardinality K of the expansion in Equation 03}) as needed. The second prob- 
lem involves the optimization of the objective function 1(6) which depends 
on the unknown free energy A(z) and the intractable partition function Z(6) 
(Equation (fl"3l)). An obvious approach is by gradient descent which is discussed 
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in detail in section 12.11 This requires the computation of expectations with 
respect to p(z \ 6) (Equation (fl4l)). The intractability of p(z \ 6) necessitates 
the use of Monte Carlo sampling which must be carried out in the expanded 
space with respect to the joint density p(q, z \ 6) (Equation ([5])). This should 
nevertheless be able to capture multiple modes in the high- dimensional state 
space consisting of atomic degrees of freedom q and parameters z. To that 
end we propose performing this step by using non-equilibrium path sampling 
techniques based on adaptive Sequential Monte Carlo schemes discussed in 
section 12.31 Similar schemes for creating system replicas in parallel have been 
employed in [^T|35j . We discuss a novel adaptive version that retains previ- 
ous advantages while providing accurate estimates at reduced computational 
effort. These estimates can be readily updated as 6 changes after each opti- 
mization step. 

A discussion of each of the aforementioned algorithmic modules is contained 
in the ensuing sub-sections. The steps of the scheme proposed can be found 
in Algorithm [2J 

2.1 Optimization with noisy gradients 

We propose employing a gradient descent scheme in order to determine 0, 
although more involved procedures such as Improved Iterative Scaling [3|T6j . 
noisy conjugate gradients |32] can be also be employed. Second-order (quasi- 
) Newton techniques are also possible although the unavoidable Monte Carlo 
noise in the computation of the Hessian (i.e. the covariance in Equation ( fT5|) ) 
can destroy its positive definiteness. 
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Let 0k denote the vector of kernel amplitudes (Equation Q) when K such 
kernels are used. Let also 0K,m denote the estimate of Ok after m iterations of 
the gradient descent algorithm. Then at the (m + 1)— iteration, the following 
update equation could be used: 

0K,m+l = &K,m ~ A J(#fC m ) (16) 

where A is the learning rate. 

Since only a noisy Monte Carlo estimate of the gradients J{0) (Equation 
(TH1) ) is available, it is anticipated that the noise could impede convergence. 
For that purpose we propose employing a stochastic approximation variant of 
the Robbins & Monro scheme [13117] . Rather than increasing the simulation 
size in order to reduce the variance, we compute a weighted average of the 
gradient's estimates at the current and previous iterations. By employing a 
decreasing sequence of weights, information from the earlier iterations gets 
discarded gradually and more emphasis is placed on the recent iterations. 
As it is shown in [TTJ, this method converges with a fixed sample size. In 
particular, if JiQK.m) denotes the Monte Carlo estimate of the gradient (the 
details of this estimator are discussed in section 12731) at the m th iteration, then 
we calculate: 

J m = (1 - r) m )Jm-l + VmJ (0K,m) (17) 

and, rather than Equation ffT6l) . we update 6 as follows: 

0ff,m+l = ®K,m — A J m (18) 

where the sequence of weights {r] m } is such that J2m=i Vm = oo and Z)m=i Vm < 
oc 3 . 



A family of such sequences that was used in this work is rj m = rjm p with 
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2.2 Kernel selection - Sparse representation of free energy landscape 



A critical objective in the proposed framework relates to the sparseness of the 
free energy approximation i.e. the cardinality K of the expansion in Equation 
([3]). This is important in at least two ways. Firstly, because sparser representa- 
tions can more clearly expose salient features of the free energy landscape, and 
as a consequence, of the atomistic ensemble considered. Secondly, because they 
reduce the number of parameters 6 with respect to which the optimization 
problem needs to be solved (section 12. II) . Given a vocabulary of potentially 
overcomplete basis functions and a prescribed K, the problem amounts to 
identifying those kernels (Equation ([3])) that best approximate the true free 
energy surface i.e. minimize the KL divergence for z £ T> (Equation ffTUj) ). 
This obviously implies an excessive computational effort since the aforemen- 
tioned optimization problem would need to be solved for all possible K— sized 
combinations of basis functions. 

For that purpose, we propose a hierarchical scheme that proceeds by adding a 
single kernel at each step. Similar greedy optimization procedures have been 
successfully applied in maximum entropy problems [16] . Without loss of gener- 
ality, one can consider a vocabulary of functions that consists of the isotropic 
Gaussian kernels discussed in Equation fll]). Each of these is parametrized by 
the location Zj of the kernel and its bandwidth Tj. Given K such kernels, 
the corresponding parameters 6 k = {0j}f = i (Equation ([3])) that minimize 
1(6) in Equation ( ITT1) and samples from the density p(z | 6k) (Equation 
OH]) or Equation ((5])), we propose selecting the (K + l) th kernel by choosing 



1/2 <p < 1. 
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(z k+1 ,t k+ i = {r K+lt i}f =1 ^ that maximize: 

(z K +i, t k+ i) = arg max E^ z) [K(z, z K+1 ; t k +i)} - E P {«\e K ) [K(z, z K+1 ; t k +i)] 

( z K+l,TK+l) 

(19) 

Based on Equation $H]), this suggests augmenting our expansion with the 
kernel that locally maximizes the gradient of 1(0). Intuitively this means that 
we incorporate the kernel function whose expected value with respect to the 
target, uniform distribution is worst approximated by the current density p(z | 
0k)- It is obviously a suboptimal strategy, that is necessitated by reasons of 
computational cost. The maximization of the objective in Equation (1191) can be 
readily carried out given samples fromp(z | 0r)- The same formulation can be 
applied to any type of kernel or overcomplete basis employed (e.g. wavelets). 
The proposed strategy promotes sparseness and computational efficiency while 
offering a progressive resolution of the free energy landscape that naturally 
involves kernels with larger bandwidths (smaller r) in the first steps, and 
successive unveiling of the finer details which can be captured by kernels of 
smaller bandwidths (i.e larger r). 

Furthermore, it offers a rigorous metric for monitoring convergence. In partic- 
ular if Ak(z\0k) and Ak+i(z; 0k+i) denote the free-energy approximations 
obtained at successive steps using K and K + 1 kernels respectively, and px 
and Pk+i the corresponding densities in Equation then the improvement 
in terms of Kullback-Leibler divergence (Equation ({TO]) ), denoted by A^ + i 
can be assessed with: 



< A K+1 = KL(vr || p K ) - KL(tt || p K+1 ) 

(20) 

= -0E V [A K+1 (z; k+1 ) - A K (z; K )] + log 
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The expectation with respect to the uniform can in general be calculated 
analytically whereas the ratio of normalizing constants log ^nf^r (Equation 
OS}) is a direct output of the Sequential Monte Carlo sampling that is used to 
sample from the augmented densities and is discussed in the next section. 

2.3 Adaptive Sequential Monte Carlo 

The learning scheme proposed relies on efficient computations of the gradi- 
ents appearing in Equation (TH1) . These depend on expectations with respect 
to p(z | 0) (Equation (jUJ)) which is not available analytically since the actual 
free energy A(z) is unknown. We resort to a Monte Carlo scheme that draws 
samples from the joint density p(q, z | 6) in Equation OS]) which involves the 
atomic degrees of freedom q. A brute-force approach would generally be inef- 
ficient as simulating atomistic trajectories suffers from well-known difficulties 
such as the high-dimensionality of q, the disparity in scales between q and z 
and the presence of several metastable wells [22]. Furthermore, since p(q, z | 0) 
depends on 6, new samples would have to drawn every time 6 changes after 
each iteration of the optimization routine. 

For these reasons, we propose a parallelizable strategy that relies on Sequen- 
tial Monte Carlo samplers (SMC, [TolfTB"] ). These offer a general statistical 
perspective that unifies a range of pertinent schemes that have been proposed 
in the context of non-equilibrium path sampling following Jarzynski's work 
[26P8] . We propose novel extensions that allow the algorithm to automati- 
cally adapt to the difficulties of the target density. They retain the ability to 
interact seamlessly with legacy, molecular dynamics simulators. 
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The proposed SMC schemes offer a flexible framework for sampling from a se- 
quence of unormalized probability distributions and are therefore highly suited 
for the dynamic setting of the problem at hand where the target density 
p(q, z | 0) changes with 6. For a given 6, they approximate p(q, z | 0) with a 
set of n random samples (or particles) {c^ l \ zW}" =1 , which are updated using a 
combination of importance sampling, resampling and MCMC-based rejuvena- 
tion mechanisms [H] . Each of these particles is associated with an importance 
weight which is proportional to p(qW,zW | 0). The weights are updated 
sequentially along with the particle locations in order to provide a particulate 
approximation: 

n 

p(q,z|0)^£ jy»5 q(l) (q)5 z(l) (z) (21) 

i=l 

where W {i) = w w /Ef= 1 w( k > are the normalized weights and 5 x (i)(.) is the 
Dirac function centered at x". These particles and weights can be used to 
estimate expectations of any p(q, z | 0)-integrable function [T3f§] . In particular 
for Equation (fTij) : 

^ Kj(z) ->■ / fC,(z) p(q, z | 0) dqrfz = -Ep( z |e) [Kj(z)] (almost surely) 

i=l 

(22) 

The proposed SMC algorithms will be used iteratively, after each step of the 
gradient descent algorithm. Given two successive estimates Or,™, and 6 K ^ m+ i 
(Equation ffT8]) ) and a particulate approximation of p(q, z | Ok,™,), the goal 
is to obtain new samples from p(q, z | ^,m+i) (Algorithm [2]) and compute 
the new expectations in Equation f[T4"|) based on Equation f[2~2"j) . The quality of 
the Monte Carlo estimates in Equation ( )22|) depends on the proximity of the 
distributions p(q, z | 0R- !m ) and p(q, z | 0fir im+ i). We propose building a path 
of intermediate, unormalized distributions that will bridge this gap based on 
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Equation (JSJ) 







7r 7 (q, z) oc p(q, z | (1 - i)0 K ,m + 1&K ,m+l J 



(23) 



exp{-/3(V(q,z)-A(z;0 7 ))}, 7 6[0,1] 



where: 



1 = (1 - j)0 K ,m + 7^X,m+l 



(24) 



Clearly for 7 = one recovers p(q, z | #x,m) and for 7 = 1, p(q, z | #A- im +i). 
The role of these auxiliary distributions is to provide a smooth transition 
path where importance sampling can be efficiently applied. Naturally, the 
more intermediate distributions are considered along this path, the higher the 
accuracy of the final estimates, but also the higher the computational cost. 
On the other hand too few intermediate distributions 7r 7 can adversely affect 
the overall accuracy of the approximation. 

To that end we propose an adaptive SMC scheme that automatically deter- 
mines the number of intermediate distributions needed [T4"f3T| . In this process 
we are guided by the Effective Sample Size (ESS, [36]). In particular, let S 
be the total number of intermediate distributions (which is unknown a pri- 
ori) and 7 S , s — 1, 2, . . . , S the associated reciprocal temperatures such that 
= 71 < 72 < ••• < 7s = 1, which are also unknown a priori. Let also 
{(q^,z^), W^}f =l denote the particulate approximation of 7r 7s defined as 
in Equation (l23il for 7 = 7 S . The Effective Sample Size of these particles is then 
defined as ESS, = l/E*=i(^ W ) 2 and provid es a measure of the population 
variance. One extreme, i.e. when ESS S = 1, arises when a single particle has 
a unit normalized weight whereas the rest have zero weights and as a result 
subscripts K and m indicating the number of kernels and optimization iterations 
respectively have been dropped 
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provide no information. The other extreme, i.e. ESS S = N, arises when all the 
particles are equally informative and have equal weights W} 1 ' = 1/N. 

If the next bridging distribution vr 7s+1 is very similar to 7r 7s (ie. 7 s+ i ~ j s ), 
then ESS s+ i should not be that much different from ESS S . On the other hand if 
that difference is pronounced then ESS s +i could drop dramatically. Hence in 
determining the next auxiliary distribution, we define an acceptable reduction 
in the ESS, i.e. ESS s+ i > ( ESS S (where £ < 1) and prescribe 7 s+ i (Equation 
( 123]) ) accordingly. 

The proposed adaptive SMC algorithm is summarized in Algorithm [TJ It 
should be noted that unlike MCMC schemes, the particle perturbations in 
the Rejuvenation step do not require that the P s (., .) is ergodic [T5]. It suffices 
that it is a ir ls -invariant kernel, which readily allows adaptively changing its 
parameters in order to achieve better mixing rates. In the examples presented 
a Metropolized Gibbs scheme was used to sample q and z separately by em- 
ploying a Metropolis- Adjusted Langevin Algorithm (MALA) for each set of 
coordinates [H]. In particular given (<?a-ij z s-i) these consist of: 

• Updating — > q®: 

• Updating — > z$: 

zf - zf_, = ^w^M\ *&) + V^r z r z 

(27) 

= (V z U(qJ2 1 , zfl,) - VJizfl,; ft,.)) + jM q r q 



= ^V q vr 7s (q2 1 , 2 2 1 ) + v / ^^ 

(26) 

= -^V^Viq^z^) + jAT q r q 
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Algorithm 1 Adaptive SMC 



Require: s — 1 and 71 = and a population Zi), w^}^ which 

approximate 7r 7l = p(q, z | Ok,™) in Equation ( 1231) 

Ensure: The final population {{6^ , d^) , w^}f =l provides a particulate ap- 
proximation of 7r 7s in the sense of Equations (12T1) . (|22|) . 
while 7 S < 1 do 
s <- s + 1 

{Reweighting-Importance Sampling} 

Let 



>s — l^s — 1' s — l' 



(25) 



1 "pj-^V^,^!)-^,^.!)} 

= exp (7s ~ 7s-l)(0RT,m+l - 0ft>i)} 

be the updated weights as a function of 7 S . Determine 7 S G (7^-1, 1] so 

that ESS S = CESS s _x . 

{Resampling} 

if ESS, < ESS min then 

Resample 
end if 

{Rejuvenation} 

Use an MCMC kernel P s {{q^-xi z s-i)> (Qs \ z i^)) that leaves 7r 7s invariant 
to perturb each particle (qr^— 1? z< s-i) ~ * (<Zs*\ z ^)- 
end while 

where r q and r z are i.i.d standard Gaussian vectors. A Metropolis accept/reject 
step with respect to the target invariant density vr 7s (.) was performed after 
each update. Two different time steps were used At q and At z for the q and 
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z coordinates respectively. Their values were adjusted after each iteration s 
so as to retain an average acceptance ratio (over all particles n) between 50% 
and 80% [H]. The adaptivity afforded by the proposed scheme relies on the 
fact that ergodicity is not required from the rejuvenation step. As a result 
several MALA time steps can be performed in Equations (1261) and ff27|) (at 
additional computational expense) or other molecular dynamics samplers can 
be employed which could potentially exhibit better mixing or fit more closely 
to the physics of the problem at hand [6]. 

Finally we note that the estimates of the ratio of normalization constants 
Z s /Z s _i between two successive unormalized densities vr 7s _ 1 and 7r 7s can be 
obtained by averaging the unormalized updated weights in Equation fl25l) as 
a direct consequence of the importance sampling identity: 

Zs—l J' 7r 7s _ 1 (q,z) dqdz 

= J ^ ( r } "V (q ' z) dqfr (28) 

~ l^i=l vv s-l TIT) U) : 

These estimators can be telescopically multiplied ( [T5|30] ) in order to com- 
pute the ratio of normalization constants between any pair of distributions as 
required in Equation (T20"|) . 

Given the preceding discussion in sections 12.11 12.21 and 12. 3^ we summarize 
below the basic steps in the proposed free energy computation scheme: In the 
inner loop and for fixed K, gradient descent (subsection I2.ip is performed 
which makes use of the adaptive SMC scheme (subsection I2.3j) in order to 
compute the expectations in the gradient. In the outer loop, the cardinality 
of the expansion K is increased by adding one kernel (i.e. K <— K + 1) based 
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on Equation (1191) . This is terminated when the KL gain (Equation ( 120]) ) does 
not exceed a prescribed tolerance. 



Algorithm 2 Calculation of the free energy at a given temperature. 
Require: K = 0, 6q = and a particulate approximation of p(q, z | 6q) 

(Equation at the desired temperature (3. 

while true do 

Calculate A K based on Equation (120]) . 

if Ak < tol then 
Break the loop. 

else 

Add the optimal (K + l) th kernel based on Equation ( fl9l) and set K <r- 

K + l 

repeat 

Estimate gradient at Or,™ and calculate update Of^m+i based on 
Equation (TTg]) 

Use adaptive SMC (section |2 .3 j) to construct particulate approxima- 
tion of p(q, z | K ,m+i) from p(q, z | K ,m)- 
until Convergence criteria are met. 
end if 
end while 



2.4 Obtaining the free energy landscape for various temperatures. 

The methodology described in the previous sections is suitable for calculating 
the free energy as a function of z at a given temperature. However, one is often 
interested in the free energy landscape as a function of the temperature also. 
In order to achieve this goal we make use of the following two facts. Firstly, 



19 



the free energy landscape at higher temperatures is flatter and secondly that 
nearby temperatures have similar free energies landscapes. Based on these, we 
propose a natural extension to the sequential sampling framework of subsec- 
tion 12.31 that can efficiently produce estimates of the free energy at various 
temperatures. The idea is to start from a higher temperature, compute the 
free energy as described before, then gradually move towards lower temper- 
atures using the free energy of the previous temperature as an initial guess 
for the new one. In particular given the free energy estimate A^ 1 (z;0(^i)) 
and the particulate approximation of the joint density ^(q, z | 0(/?O) at a 
temperature we propose employing the aforementioned adaptive SMC 

in order to obtain a particulate approximation of the following joint density 
at > Pi (i.e. for lower temperature): 

^ 2 (q,z | 0(00) ex exp{-/3 2 (V(q,z) - i ft (z; 0(/?O)) } (29) 

The iterations enumerated in Algorithm [2] can then be carried out in the 
same fashion by updating the existing as well as adding new kernels if the 
convergence criteria are not satisfied. 

The critical step involves building a sequence of distributions from (q, z | 
0(/3i)) to £>/3 2 (q, z | 0(/?O) i n Equation ff29l . For this purpose and similarly to 
a simulated annealing schedule we employ: 

7r 7 (q, z) oc exp {-((1 - 7 )ft + 7 /? 2 ) (V(q, z) - A^(z; 0(ft))) } (30) 

The steps in Algorithm [1] should be adjusted to the aforementioned sequence 
of bridging distributions with the most striking difference in the Reweighing 
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step where the updated weights in Equation (1251) should now be given by: 

= exp {-( Ts - 7 _ 1 )(& - A) (V(q£i,*2i) " W))} 

(31) 

We demonstrate the efficacy of such an approach in the last example of section 
[31 It is finally noted that at the beginning of iterations at each new tempera- 

■ 

ture, kernels with very small weights 0, were removed if — 3 —^- < 0.01. 

2.5 The reaction coordinate case. 

The proposed method was described for the alchemical case. However, it is 
straightforwardly generalized to cover also the general reaction coordinate 
case. Let £ : M. — >■ T> be a function of the system coordinates. This function 
is called a reaction coordinate [33]. It is evident that q can be viewed in a 
probabilistic framework as a random variable and so: 

z = |(g) (32) 

is also a random variable. The probability distribution of z can be found by 
integrating out the q: 

p(z | 0) = J p(q)5(£(q) - z)dq oc J exp (-/3V(o)) - *)dq (33) 

The free energy A(z) with respect to the reaction coordinate £(q) is defined 
to be the effective potential of z = £(qr) 

p(z) oc exp (-/L4(z)) (34) 
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Combining these two equations we see that: 

A(z) = -/T 1 log I exp (-PV(q)) 5(£(q) - z)dq 



(35) 



If A(z; 0) is an estimate of A(z), we define a new probability distribution over 
q by: 

p(q\0) oc exp (-/3(V(q) - A(£(q); 0))) (36) 

It is straight forward to see that under this new distribution for q, the pdf of 
z becomes: 

p(z\0) = J p(q\0)5(£(q) - z)dq cc l v (z) exp (-(3(A(z) - A(z; 0))) (37) 

This coincides with the expression in Equation ([9]) and therefore the ensuing 
derivations hold identically. From a practical point of view, sampling need 
only performed in the q space and therefore the adaptive SMC schemes are 
employed to obtain particulate approximations of the density in Equation (136j) . 
The only difference appears in the MCMC-based Rejuvenation step where the 
MALA sampler is employed only with regards to q. In particular the update 
of Equation fl2"61) now becomes: 

qf - «& = ^VqTT^l) + jAf q r q 

(38) 

= (V q y(<£i) - #V q £(q)) + jAT q r q 

It is noted that, in contrast to some ABF methods which require second-order 
derivatives of £ [21] , the proposed technique only needs first-order derivatives. 
Finally, we point out that the ability of the proposed approach to provide 
efficiently estimates of parametrized free energy surfaces (as in section 12.41 
with respect to the temperature /3), can also be exploited in the reaction 
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coordinate case by defining a joint density: 

p(q, z | 0) oc exp (V(q) + | || z - £(q) f -i M (z; 0)) } (39) 

where as in [37] an artificial spring with stiffness fi has been added. Clearly for 
fi — > oo one recovers the aforementioned description, but for all other values 
of fi the formulation reduces to that of Equation ([5]) where in place of V(q, z) 
we now have V(q) + | || z — £(q) || 2 . One can therefore obtain free energy 
surfaces for various /i values. For smaller fi the free energy would be flatter 
and in the extreme case of \i — it would be constant. As fi increases, the 
complexities of the free energy surface would become pronounced. Hence by 
exploiting the idea of section 12. 4[ a sequence of problems parametrized by [i 
rather than (3, can be constructed to gradually move to larger fi values by 
using the free energy of the previous fi as an initial guess for the new one. 
The adaptive SMC scheme would ensure a smooth enough transition while 
retaining a good level accuracy for the approximations obtained. 

3 Numerical Examples 

3. 1 Two-Dimensional Toy Example 

Consider a two-dimensional system [5^fM] with a single parameter z, inter- 
acting with potential energy 

V(q; z) = cos(27rz)(l + diq) + d 2 q 2 

Assume that q given z and (3 is distributed according to 

p(q\z, /3) oc exp (-/3V(q; z)) 
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where (3 is also a fixed parameter that plays the role of an inverse temperature. 
We wish to calculate an approximation A(z) of the free energy A(z) on an 
interval D = [—0.5, 0.5]. The true free energy can be found analytically to be 

... . d\ cos(2itz) 2 

A(z) = cos(2ttz) \ — + c 

4d 2 

where c is a constant that depends upon the specific choice of the fixed pa- 
rameters. In what follows, we choose c so that A{— 0.5) = 0. 

To demonstrate our method in this simple example we used d\ = 2, d% = 30. 
The potential energy V(q; z) for this choice of the parameters is depicted in 



Figure 1(a) We fix the inverse temperature to (3 = 10. As shown in Figure 



1(b), the distribution is bimodal with a big region of practically zero prob- 
ability separating the two modes. Hence, metastability along the parameter 
z is apparent. The performance of the proposed method with respect to the 
number of particles used in the adaptive SMC scheme is depicted in Figures 
[2] and [3] which show the evolution of the estimated free energy landscape with 
n = 100 and n — 10, 000 particles respectively. In both cases the method is 
capable of capturing the correct characteristics of the reference solution and 
as expected the noise in the computations is less when the number of parti- 
cles is larger. In both cases the Robbins-Monro learning series is picked to be 
r] m = m~ v with p = 0.6 and the learning rate A = 0.1. 



Figure 4(a) shows the first three kernels selected by the greedy scheme of 



section 12.21 Figure 4(b) depicts the log- values of the kernel weights {9j}f =l 
which clearly demonstrate the ability of the proposed approach to provide 
sparse approximations. The first kernel selected has the greatest weight and 
hence it contains the majority of the information about the free energy curve. 
The rest of the kernels are progressive corrections of the estimate given by the 
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(a) The potential energy V(qi,q2) for (b) The probability distribution 
d x = 2,d 2 = 30 p(?i,«2|/9) for di = 2,d 2 = 30, >S = 10 

Fig. 1. Potential energy and pdf for the toy example of section [3~T1 



first kernel. This conclusion is also supported by the result of Figure which 
shows the evolution of the reduction in the KL divergence with respect to the 
total number of iterations as quantified by adding the A^+i in Equation (120]) . 
Clearly the first kernel offers the greatest KL gain (Ai) and further kernel 
additions offer progressively smaller reductions in the KL divergence. 



3.2 WCA Dimer 



We consider N = 16 atoms in a two-dimensional fully periodic box of side / 
which interact with a purely repulsive WCA pair potential [M] : 



4e 




(?) 



12 



+ e , if r > r 
, otherwise 



where a and e give the length and energy scales respectively. Two of these 
atoms (say atoms 1 and 2) are assumed to form a dimer and their interaction 
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reference 



-0.4 -0.2 



(a) K=l 



0.2 0.4 




-0.4 -0 



(b) K=2 



0.2 0.4 





(c) K=3 (d) K=8 

Fig. 2. Evolution of free energy estimates at various kernel numbers K when usin^ 
n = 100 particles in the adaptive SMC scheme 

is described instead with a double well potential: 



V s {r) = h 



(r — r — w)' 



where h,w,ro are fixed parameters and r the distance between them. This 
potential has two equilibrium points r and r + 2w. The barrier separating 
the two equilibria is h. 



Let q = (<7i, 92, • • • j Qn) with qi e M 2 denoting the position of atom i. The 
potential energy of the system is: 



2 N 



V{q) = V s {\q 1 -q 2 \) + ^2J2 V wcA{\qi-q j \)+ E HvcAd^-^l) 

i=l j=3 2<i<j 
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(c) K=3 (d) K=8 

Fig. 3. Evolution of free energy estimates at various kernel numbers K when using 
n = 10, 000 particles in the adaptive SMC scheme 

We consider an NVT ensemble (the volume V is determined by the side of the 
box I). The probability distribution of the atomic positions q is 

p(q\/3)ocexp(-/3V(q)) 

where (3 = -j^f, &b is the Boltzman constant and T the temperature of the 
system. Under these assumptions atoms 1 and 2 will form a dimer with two 
equilibrium lengths. An effective potential of the dimer length in the presence 
of the other atoms is given by the free energy A(r) with respect to the reaction 
coordinate 

z = £(q) =11 Qi ~ <?2 h 
where || ■ lU is the Euclidean norm of 1R 2 . 
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(a) First three kernels Kj(.) picked by 
the greedy optimization scheme to il- 
lustrate selection of location and band- 
width parameters (n = 10, 000) 

Fig. 4. Kernels selected and kernel wei; 




kernel number j 



(b) Log absolute weights (log | 9j |) of 
the first 10 kernels added. The 6 value 
of the first kernel is over one order of 
magnitude larger than the rest 
;hts obtained with n = 10, 000 particles 




2 - 

I , 1 , 1 , 1 , 1 , 1 , 

5000 10000 15000 . 20000 25000 30000 

iterations 

Fig. 5. Evolution of the reduction in the KL divergence at various K with respect tot 
the total number of iterations performed. Upon addition of each kernel a reduction 
Ak (Equation (|20p ) is achieved which becomes progressively smaller. 

We calculate A(z) using our scheme for two different box sizes (densities): 
I = 4 (high density) and I = 12 (low density). The parameters are set to 
N = 16 atoms, /3 = 1, e = 1, a — 1, h — 1, w — 0.5. We employed n = 500 
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(a) Low Density I = 12 (b) High Density I = 5 

Fig. 6. The free energy of the dimer at two different densities compared with Vs(r). 



Notice that at low density (a) the right well becomes the most probable. This 
situation is reversed at high density |(b)| 

particles and the Robbins-Monroe learning series is again r] m = m~ p with 
p = 0.501 and A = 0.1. The resulting free energy curves at various stages of 
the estimation process with increasing number of kernels are shown in Figure 



|6j We notice that at low density i.e. when the box size is I = 12 (Figure 6(a) ), 
the equilibria move to the right with the well closest to r + 2w becomes the 
most probable. Furthermore the free energy barrier is slightly decreased as 



compared to the high density case when / = 5 (Figure 6(b)). Under these 
conditions the equilibria move to the left and the well closest to r becomes 
the most probable. 
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3.3 38- Atom Lennard- Jones Cluster (LJ^) 



We consider a 38-atom cluster in 3-dimensional space with pairwise interac- 
tions given by the Lennard- Jones potential: 



^Lj(r) 



a\ 12 



(40) 



with e and a playing the role of energy and length scale respectively. Let the 
Cartesian coordinates of the system be 

q= (q 1 ,...,q 38 ),q i £« 3 (41) 



Then the potential energy of the system is 

nq) = ^E^j(|q,.-q,|) 
Finally we assume that the particles follow an NVT distribution of the form 



p(q|/3)ocexp{-/3y(q)} 



where (3 = 1/ksT. At zero temperature the system is known to have a global 



minimum yielding an FCC truncated octahedron (Figure 7(b)). The second 



and third lower energies give incomplete Mackey icosahedra. Furthermore 
there is a big number of liquid-like local minima ( |19|lo] ). 



Consider the family of order parameters initially introduced in |47j : 

2 



Qi 



An 



21 + 



7 \Qlm\ 

J- ~- 7 



(42) 



rn=—l 



with 



Qlm - 77 X! Ylm{9ij , 4>ij) 
7V & r«<ro 

where the sum is over all the iV& pairs of atoms with = |qj — q 3 -| < r , 
Yim{0, 4>) is a spherical harmonic, while and <pij are the polar and azimuthal 
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(a) Qi « 0.01 (b) Q4 « 0.19 (truncated octahedron) 

Fig. 7. Indicative metastable states corresponding to the two wells of the free energy 
landscape with respect to order parameter Q4 (Equation 



angles of a bond vector with respect to an arbitrary coordinate system. In [5] 
it is shown that for I — 4, Q4 can distinguish the FCC structure but not 



the icosahedral and liquid-like minima (Figure 7(a)). However, if one also 
considers the energy the two structures are well-separated. Hence, we define 
the two dimensional reaction coordinate: 

£(q) = (Q4(q),F(q)) 

we compute the free energy 

A(Q 4 , E) = /T 1 / exp {-(3V(q)} 5(Q 4 - Q 4 (q))5(E - V(q))dq 
over the domain 

D = [0,0.2] x [-175e,-145e] 

for a temperature range ksT = 0.21 to ksT = 0.091 using the temper- 
ing scheme described in Section 12.41 We employ n = 100 particles and 10 
MCMC/Rejuvenation steps per particle. At each = ksT, the Robbins- 
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Monro learning series was adjusted to r\ m = m~ p with p = 0.501 and a learning 
rate A = 0.1//3. The adaptive SMC scheme automatically determined 260 in- 
termediate steps/distributions in order to cover the whole range of the afore- 
mentioned temperatures. While the time step At q employed in the MALA 
sampler was adaptively adjusted as discussed previously and took values be- 
tween 10~ 4 (low temperatures) and 7 x 10~ 4 (high temperatures). The very 
first step, at T = 0.21 (f3 = 4.76) required 12,000 optimization iterations to 
converge with a cost of approximately 7.2 x 10 5 time steps per particle. It is 
emphasized that due to the embarrassingly parallelizable nature of the SMC 
scheme employed , each particle can be simulated in a different CPU, largely 
independently of the rest. The sequence of intermediate /3's determined auto- 
matically by the scheme discussed in section 12.41 is depicted in Figure El The 
similarity of the free energy surfaces at neighboring temperatures allowed us 
to converge with, on average, 800 optimization iterations at each intermediate 
f3. The overall cost was 2.4 x 10 4 time steps per particle, i.e. equivalent to 
1/30 of the cost for calculating the free energy from scratch at the initial j3. 
The free energy surfaces computed are depicted in Figure [9] at four indicative 
temperatures. The number of kernels selected by the algorithm varied between 
90 and 120. As it has been reported in previous studies [5], we identified two 
metastable states at Q4 ~ 0.01 which corresponds to the truncated octahedron 
and at Q4 ~ 0.19 which corresponds to the icosahedron. The latter becomes 
more pronounced at lower temperatures. 

In order to assess the quality of the results in two dimensions we also calculated 
the free energy profile using only Q4 as the reaction coordinate (Figure [TO]) . 
We also performed a numerical integration of the two-dimensional free energy 
surface i.e. by computing A(Qi) = —(3~ l \ogj e~ /3A ^ 4 " ES> dE with respect to 
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Fig. 8. Sequence of intermediate /3's identified by the scheme discussed in section 
12.41 for the L cluster. The free energy landscape was calculated at each of these 
temperatures by efficiently updating the free energy surface at the previous step, 
the second reaction coordinate. The two free-energy curves are depicted in 
Figure [TD] where good agreement is observed at two different temperatures. 



4 Conclusions 



In summary, the proposed method provides a unifying framework for estimat- 
ing the free energy function simultaneously with biasing the dynamics. The 
minimization of the Kullback-Leibler divergence in the extended space pro- 
vides rigorous convergence bounds and diagnostics. It requires minimal ad- 
justment of parameter values a priori (basically only the learning rate A and 
convergence tolerances) as it is adaptive and automatically promotes sparse 
representations of the free energy surface. It offers several possibilities for fur- 
ther improvements by considering different optimization schemes (e.g. noisy 
conjugate gradients) and employing different basis functions (e.g. wavelets). 
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0.05 Oi. 0.15 0.2 0.05 OA 0.15 0.2 

(a)T = 0.21 (b)T = 0.17 




(c)T = 0.14 (d)T = 0.11 

Fig. 9. Free energy contours A{Q^,E) with respect to the two reaction coordinates 
Qi (x-axis) and E (y-axis) at various temperatures for LJ^s 
Its sequential nature allows the efficient computation of a family of free en- 
ergy surfaces at different temperatures. We believe that these features make 
the proposed approach suitable to calculate the free energy of systems more 
physically challenging than the ones discussed in this paper. 
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Fig. 10. Free energy profiles A(Q^) obtained using the proposed scheme while using 
only reaction coordinate (dashed lines) and by integrating the two dimensional free 
energy surface i.e. A(Q^) = — /3 _1 log J e~^ A ^ i,E ^dE (solid lines) for two tempera- 
tures T = 0.19 and T = 0.14 
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